About the project

In this course the students are getting an introduction in the interesting world of data sciences. In this course the tools which are mostly used are RStudio, R markdown and GitHub.

Here is the link to my personal github repository


Analysis, IODS week 2

Overlook in data

Lets read the data from given site. Then we take a look for the data structure.

lrn14 <- read.table("http://s3.amazonaws.com/assets.datacamp.com/production/course_2218/datasets/learning2014.txt", sep=",", header=TRUE)


str(lrn14)
## 'data.frame':    166 obs. of  7 variables:
##  $ gender  : Factor w/ 2 levels "F","M": 1 2 1 2 2 1 2 1 2 1 ...
##  $ age     : int  53 55 49 53 49 38 50 37 37 42 ...
##  $ attitude: num  3.7 3.1 2.5 3.5 3.7 3.8 3.5 2.9 3.8 2.1 ...
##  $ deep    : num  3.58 2.92 3.5 3.5 3.67 ...
##  $ stra    : num  3.38 2.75 3.62 3.12 3.62 ...
##  $ surf    : num  2.58 3.17 2.25 2.25 2.83 ...
##  $ points  : int  25 12 24 10 22 21 21 31 24 26 ...
dim(lrn14)
## [1] 166   7

We have seven variables and 166 observations. As variables we have gender and age of the respondent, attitude toward statistics, deep learning, stratetig learning, surfage learning and exam points. Link to meta text is in here.

Lets have a brief graphical overview to data.

# access the GGally and ggplot2 libraries
library(GGally)
library(ggplot2)

# create a more advanced plot matrix with ggpairs()
p <- ggpairs(lrn14, mapping = aes(col = gender, alpha = 0.3), lower = list(combo = wrap("facethist", bins = 20)))

# draw the plot
p

From graph above we can see how variables are distributed and how variables are related to each others. Colours used in graph are signaling the gender (Blue means male, red female). We can see that data set has more women than men (110 females, 56 males). Mostly distributions do not differ significantly between the genders, only in attitude distribution profile is significanty different, also in stratetig learning and surfage learning means look to differ.

Here is the summary of the data.

summary(lrn14)
##  gender       age           attitude          deep            stra      
##  F:110   Min.   :17.00   Min.   :1.400   Min.   :1.583   Min.   :1.250  
##  M: 56   1st Qu.:21.00   1st Qu.:2.600   1st Qu.:3.333   1st Qu.:2.625  
##          Median :22.00   Median :3.200   Median :3.667   Median :3.188  
##          Mean   :25.51   Mean   :3.143   Mean   :3.680   Mean   :3.121  
##          3rd Qu.:27.00   3rd Qu.:3.700   3rd Qu.:4.083   3rd Qu.:3.625  
##          Max.   :55.00   Max.   :5.000   Max.   :4.917   Max.   :5.000  
##       surf           points     
##  Min.   :1.583   Min.   : 7.00  
##  1st Qu.:2.417   1st Qu.:19.00  
##  Median :2.833   Median :23.00  
##  Mean   :2.787   Mean   :22.72  
##  3rd Qu.:3.167   3rd Qu.:27.75  
##  Max.   :4.333   Max.   :33.00

Linear analysis

Lets define exam points as dependent variable. Variables with highest correlation coefficients related to exam points are attitude, stratetig learning and surface learning. They are the independent variables in our model

\[Y_{i} = \beta_{0} + \beta_{1}X_{1i} + \beta_{2}X_{2i} + \beta_{3}X_{3i} + \varepsilon_{i}\] where \(Y_{i}\) is exam points, \(X_{1i}\) is attitude, \(X_{2i}\) is stratetig learning and \(X_{3i}\) is surface learning.

Here are the plotted values of each variable:

qplot(attitude, points, data = lrn14, aes(col = gender)) + geom_smooth(method = "lm") + ggtitle("Regressor Attitude")

qplot(stra, points, data = lrn14, aes(col = gender)) + geom_smooth(method = "lm") + ggtitle("Regressor Stratetig Learning")

qplot(surf, points, data = lrn14, aes(col = gender)) + geom_smooth(method = "lm") + ggtitle("Regressor Surface Learning")

As the correlation coefficients show, variable exam points has positive relation with attitude and strategic learning and negative relation with surface learning.

Lets analyse our linear model \(Y_{i} = \beta_{0} + \beta_{1}X_{1i} + \beta_{2}X_{2i} + \beta_{3}X_{3i} + \varepsilon_{i}\).

At first we must check the model assumptions:

model <- lm(points ~ attitude + stra + surf, data = lrn14)

plot(model, which = c(1,2,5))

Residuals looks to have mean zero, points are mostly normally distributed and there are no points with significally great leverage.

Here is the summary of the model:

summary(model)
## 
## Call:
## lm(formula = points ~ attitude + stra + surf, data = lrn14)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.1550  -3.4346   0.5156   3.6401  10.8952 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  11.0171     3.6837   2.991  0.00322 ** 
## attitude      3.3952     0.5741   5.913 1.93e-08 ***
## stra          0.8531     0.5416   1.575  0.11716    
## surf         -0.5861     0.8014  -0.731  0.46563    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.296 on 162 degrees of freedom
## Multiple R-squared:  0.2074, Adjusted R-squared:  0.1927 
## F-statistic: 14.13 on 3 and 162 DF,  p-value: 3.156e-08

Regressor attitude is only statistically significant independent variable in 10 % confidence level in the model. This model has adjusted \(R^2\) value 0.19 which means that the model explains approximately 20 % of the exam poins deviation. With F-test we can check how likely both stategic and surface learning’s coefficients are equal zero.

library(lmtest)
library(sandwich)
library(car)

linearHypothesis(model, c("surf = 0", "stra = 0"))
## Linear hypothesis test
## 
## Hypothesis:
## surf = 0
## stra = 0
## 
## Model 1: restricted model
## Model 2: points ~ attitude + stra + surf
## 
##   Res.Df    RSS Df Sum of Sq      F Pr(>F)
## 1    164 4641.1                           
## 2    162 4544.4  2    96.743 1.7244 0.1815

With p-value 0.18 we can not reject the null hypothesis “surf = 0”, “stra = 0”, so we can make model without those two variables.

model2 <- lm(points ~ attitude, data = lrn14)

Checking the assumptions:

plot(model2, which = c(1,2,5))

Taking of the variables strategic learning and surface learning do not make big difference to original model.

Summary of the model \[Y_{i} = \beta_{0} + \beta_{1}X_{1i} + \varepsilon_{i}\]

summary(model2)
## 
## Call:
## lm(formula = points ~ attitude, data = lrn14)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -16.9763  -3.2119   0.4339   4.1534  10.6645 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  11.6372     1.8303   6.358 1.95e-09 ***
## attitude      3.5255     0.5674   6.214 4.12e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared:  0.1906, Adjusted R-squared:  0.1856 
## F-statistic: 38.61 on 1 and 164 DF,  p-value: 4.119e-09

In this model adjusted \(R^2\) is only slightly smaller than in original model. Also estimated OLS-coefficient for attitude, 0.57, has not changed significantly. This parameter is statistically significant in under 0.1% confidence level. Estimated value for attitude means that when attitude points increase by 1, then exam points expected increase is 3.5.


Analysis, IODS week 3

Import data

Read data from own data folder.

alc <- read.csv("C:/Users/juhol/OneDrive/Documents/GitHub/IODS-project/data/alc.csv", header = T, sep = ",")

alc <- alc[,-1] # First column only duplicates the row numbers

dim(alc)
## [1] 382  35
str(alc)
## 'data.frame':    382 obs. of  35 variables:
##  $ school    : Factor w/ 2 levels "GP","MS": 1 1 1 1 1 1 1 1 1 1 ...
##  $ sex       : Factor w/ 2 levels "F","M": 1 1 1 1 1 2 2 1 2 2 ...
##  $ age       : int  18 17 15 15 16 16 16 17 15 15 ...
##  $ address   : Factor w/ 2 levels "R","U": 2 2 2 2 2 2 2 2 2 2 ...
##  $ famsize   : Factor w/ 2 levels "GT3","LE3": 1 1 2 1 1 2 2 1 2 1 ...
##  $ Pstatus   : Factor w/ 2 levels "A","T": 1 2 2 2 2 2 2 1 1 2 ...
##  $ Medu      : int  4 1 1 4 3 4 2 4 3 3 ...
##  $ Fedu      : int  4 1 1 2 3 3 2 4 2 4 ...
##  $ Mjob      : Factor w/ 5 levels "at_home","health",..: 1 1 1 2 3 4 3 3 4 3 ...
##  $ Fjob      : Factor w/ 5 levels "at_home","health",..: 5 3 3 4 3 3 3 5 3 3 ...
##  $ reason    : Factor w/ 4 levels "course","home",..: 1 1 3 2 2 4 2 2 2 2 ...
##  $ nursery   : Factor w/ 2 levels "no","yes": 2 1 2 2 2 2 2 2 2 2 ...
##  $ internet  : Factor w/ 2 levels "no","yes": 1 2 2 2 1 2 2 1 2 2 ...
##  $ guardian  : Factor w/ 3 levels "father","mother",..: 2 1 2 2 1 2 2 2 2 2 ...
##  $ traveltime: int  2 1 1 1 1 1 1 2 1 1 ...
##  $ studytime : int  2 2 2 3 2 2 2 2 2 2 ...
##  $ failures  : int  0 0 2 0 0 0 0 0 0 0 ...
##  $ schoolsup : Factor w/ 2 levels "no","yes": 2 1 2 1 1 1 1 2 1 1 ...
##  $ famsup    : Factor w/ 2 levels "no","yes": 1 2 1 2 2 2 1 2 2 2 ...
##  $ paid      : Factor w/ 2 levels "no","yes": 1 1 2 2 2 2 1 1 2 2 ...
##  $ activities: Factor w/ 2 levels "no","yes": 1 1 1 2 1 2 1 1 1 2 ...
##  $ higher    : Factor w/ 2 levels "no","yes": 2 2 2 2 2 2 2 2 2 2 ...
##  $ romantic  : Factor w/ 2 levels "no","yes": 1 1 1 2 1 1 1 1 1 1 ...
##  $ famrel    : int  4 5 4 3 4 5 4 4 4 5 ...
##  $ freetime  : int  3 3 3 2 3 4 4 1 2 5 ...
##  $ goout     : int  4 3 2 2 2 2 4 4 2 1 ...
##  $ Dalc      : int  1 1 2 1 1 1 1 1 1 1 ...
##  $ Walc      : int  1 1 3 1 2 2 1 1 1 1 ...
##  $ health    : int  3 3 3 5 5 5 3 1 1 5 ...
##  $ absences  : int  5 3 8 1 2 8 0 4 0 0 ...
##  $ G1        : int  2 7 10 14 8 14 12 8 16 13 ...
##  $ G2        : int  8 8 10 14 12 14 12 9 17 14 ...
##  $ G3        : int  8 8 11 14 12 14 12 10 18 14 ...
##  $ alc_use   : num  1 1 2.5 1 1.5 1.5 1 1 1 1 ...
##  $ high_use  : logi  FALSE FALSE TRUE FALSE FALSE FALSE ...
summary(alc)
##  school   sex          age        address famsize   Pstatus
##  GP:342   F:198   Min.   :15.00   R: 81   GT3:278   A: 38  
##  MS: 40   M:184   1st Qu.:16.00   U:301   LE3:104   T:344  
##                   Median :17.00                            
##                   Mean   :16.59                            
##                   3rd Qu.:17.00                            
##                   Max.   :22.00                            
##       Medu            Fedu             Mjob           Fjob    
##  Min.   :0.000   Min.   :0.000   at_home : 53   at_home : 16  
##  1st Qu.:2.000   1st Qu.:2.000   health  : 33   health  : 17  
##  Median :3.000   Median :3.000   other   :138   other   :211  
##  Mean   :2.806   Mean   :2.565   services: 96   services:107  
##  3rd Qu.:4.000   3rd Qu.:4.000   teacher : 62   teacher : 31  
##  Max.   :4.000   Max.   :4.000                                
##         reason    nursery   internet    guardian     traveltime   
##  course    :140   no : 72   no : 58   father: 91   Min.   :1.000  
##  home      :110   yes:310   yes:324   mother:275   1st Qu.:1.000  
##  other     : 34                       other : 16   Median :1.000  
##  reputation: 98                                    Mean   :1.448  
##                                                    3rd Qu.:2.000  
##                                                    Max.   :4.000  
##    studytime        failures      schoolsup famsup     paid     activities
##  Min.   :1.000   Min.   :0.0000   no :331   no :144   no :205   no :181   
##  1st Qu.:1.000   1st Qu.:0.0000   yes: 51   yes:238   yes:177   yes:201   
##  Median :2.000   Median :0.0000                                           
##  Mean   :2.037   Mean   :0.2016                                           
##  3rd Qu.:2.000   3rd Qu.:0.0000                                           
##  Max.   :4.000   Max.   :3.0000                                           
##  higher    romantic      famrel         freetime        goout      
##  no : 18   no :261   Min.   :1.000   Min.   :1.00   Min.   :1.000  
##  yes:364   yes:121   1st Qu.:4.000   1st Qu.:3.00   1st Qu.:2.000  
##                      Median :4.000   Median :3.00   Median :3.000  
##                      Mean   :3.937   Mean   :3.22   Mean   :3.113  
##                      3rd Qu.:5.000   3rd Qu.:4.00   3rd Qu.:4.000  
##                      Max.   :5.000   Max.   :5.00   Max.   :5.000  
##       Dalc            Walc           health         absences   
##  Min.   :1.000   Min.   :1.000   Min.   :1.000   Min.   : 0.0  
##  1st Qu.:1.000   1st Qu.:1.000   1st Qu.:3.000   1st Qu.: 1.0  
##  Median :1.000   Median :2.000   Median :4.000   Median : 3.0  
##  Mean   :1.482   Mean   :2.296   Mean   :3.573   Mean   : 4.5  
##  3rd Qu.:2.000   3rd Qu.:3.000   3rd Qu.:5.000   3rd Qu.: 6.0  
##  Max.   :5.000   Max.   :5.000   Max.   :5.000   Max.   :45.0  
##        G1              G2              G3           alc_use     
##  Min.   : 2.00   Min.   : 4.00   Min.   : 0.00   Min.   :1.000  
##  1st Qu.:10.00   1st Qu.:10.00   1st Qu.:10.00   1st Qu.:1.000  
##  Median :12.00   Median :12.00   Median :12.00   Median :1.500  
##  Mean   :11.49   Mean   :11.47   Mean   :11.46   Mean   :1.889  
##  3rd Qu.:14.00   3rd Qu.:14.00   3rd Qu.:14.00   3rd Qu.:2.500  
##  Max.   :18.00   Max.   :18.00   Max.   :18.00   Max.   :5.000  
##   high_use      
##  Mode :logical  
##  FALSE:268      
##  TRUE :114      
##                 
##                 
## 

Data icludes 382 observations and 35 variables. Data is combined from two datasets which are loaded from here. Site includes also the meta text for data which is usefull to read. R-code for data wrangling can be found in here and data is available in here.

Analysis

Packages used in this section:

library(dplyr)
library(tidyverse)
library(ggplot2)
library(GGally)
library(cowplot)
library(boot)

Lets introduce the column names of our data.

colnames(alc)
##  [1] "school"     "sex"        "age"        "address"    "famsize"   
##  [6] "Pstatus"    "Medu"       "Fedu"       "Mjob"       "Fjob"      
## [11] "reason"     "nursery"    "internet"   "guardian"   "traveltime"
## [16] "studytime"  "failures"   "schoolsup"  "famsup"     "paid"      
## [21] "activities" "higher"     "romantic"   "famrel"     "freetime"  
## [26] "goout"      "Dalc"       "Walc"       "health"     "absences"  
## [31] "G1"         "G2"         "G3"         "alc_use"    "high_use"

In our analyse section we are interested in relationships between alcohol consumption and social parameters. Lets choose variable “high use of alcohol” (binary) as the dependent variable of interest and “extra-curricular activities” (binary), “romantic relationship” (binary), “quality of family relationships” (numeric: from 1 - very bad to 5 - excellent) and “going out with friends” (numeric: from 1 - very low to 5 - very high) as social parameters of interest. Lets also control the age and sex.

# Lets make subset of data with variables "sex", "age", "activities", "romantic", "famrel", "goout" and "high_use".

variables_sub <- c("sex", "age", "activities", "romantic", "famrel", "goout", "high_use")

alc_sub <- select(alc, one_of(variables_sub))

str(alc_sub)
## 'data.frame':    382 obs. of  7 variables:
##  $ sex       : Factor w/ 2 levels "F","M": 1 1 1 1 1 2 2 1 2 2 ...
##  $ age       : int  18 17 15 15 16 16 16 17 15 15 ...
##  $ activities: Factor w/ 2 levels "no","yes": 1 1 1 2 1 2 1 1 1 2 ...
##  $ romantic  : Factor w/ 2 levels "no","yes": 1 1 1 2 1 1 1 1 1 1 ...
##  $ famrel    : int  4 5 4 3 4 5 4 4 4 5 ...
##  $ goout     : int  4 3 2 2 2 2 4 4 2 1 ...
##  $ high_use  : logi  FALSE FALSE TRUE FALSE FALSE FALSE ...

Lets plot alcohol use which each of these variables.

alc_sex <- ggplot(data = alc_sub, aes(x = sex, fill = high_use)) + geom_bar() + ggtitle("Consumption of alcohol and sex")

alc_sex

alc_age <- ggplot(data = alc_sub, aes(x = age, fill = high_use)) + geom_bar() + ggtitle("Consumption of alcohol and age")

alc_age

alc_act <- ggplot(data = alc_sub, aes(x = activities, fill = high_use)) + geom_bar() + ggtitle("Consumption of alcohol and activities")

alc_act

alc_rom <- ggplot(data = alc_sub, aes(x = romantic, fill = high_use)) + geom_bar() + ggtitle("Consumption of alcohol and romantic relationship")

alc_rom

alc_family <- ggplot(data = alc_sub, aes(x = famrel, fill = high_use)) + geom_bar() + ggtitle("Consumption of alcohol and quality of family relationship")

alc_family

alc_goout <- ggplot(data = alc_sub, aes(x = goout, fill = high_use)) + geom_bar() + ggtitle("Consumption of alcohol and going out with friends")

alc_goout

It looks clear that high consuming of alcohol is more common for males than for females. Interesting fact in this sample is that all age groups from 15 to 18 (where we have better sample) are relatively having same part of high users. It also looks clear that in the group of students having extra-curricular activities, the heavy using is not as common as in other group. Only by looking the histogram, it looks that better quality of family relationships decreases the risk of high consumption of alcohol. Not suprisingly the group of students who are spending more time in out with friends is also consuming more alcohol.

Lets looking this in more detail. Lets start from GLM-model \[\text{HighUse}_{i} = \beta_{0} + \beta_{1}\text{Sex}_{i} + \beta_{2}\text{Age}_{i} + \beta_{3}\text{Activities}_{i} + \beta_{4}\text{Romantic}_{i} + \beta_{5}\text{Family}_{i} + \beta_{6}\text{GoOut}_{i} + \epsilon_{i}\] and then using the backward method for finding the best candidate for the best linear model.

# First model
glm1 <- glm(high_use ~ sex + age + activities + romantic + famrel + goout, data = alc_sub, family = "binomial")

step(glm1, direction = "backward")
## Start:  AIC=402.06
## high_use ~ sex + age + activities + romantic + famrel + goout
## 
##              Df Deviance    AIC
## - romantic    1   389.00 401.00
## - age         1   389.99 401.99
## <none>            388.06 402.06
## - activities  1   390.86 402.86
## - famrel      1   397.30 409.30
## - sex         1   403.62 415.62
## - goout       1   436.48 448.48
## 
## Step:  AIC=401
## high_use ~ sex + age + activities + famrel + goout
## 
##              Df Deviance    AIC
## - age         1   390.62 400.62
## <none>            389.00 401.00
## - activities  1   391.99 401.99
## - famrel      1   397.84 407.84
## - sex         1   405.01 415.01
## - goout       1   437.32 447.32
## 
## Step:  AIC=400.62
## high_use ~ sex + activities + famrel + goout
## 
##              Df Deviance    AIC
## <none>            390.62 400.62
## - activities  1   393.80 401.80
## - famrel      1   399.15 407.15
## - sex         1   406.47 414.47
## - goout       1   442.61 450.61
## 
## Call:  glm(formula = high_use ~ sex + activities + famrel + goout, family = "binomial", 
##     data = alc_sub)
## 
## Coefficients:
##   (Intercept)           sexM  activitiesyes         famrel          goout  
##       -2.2313         0.9971        -0.4494        -0.4050         0.8118  
## 
## Degrees of Freedom: 381 Total (i.e. Null);  377 Residual
## Null Deviance:       465.7 
## Residual Deviance: 390.6     AIC: 400.6

By Akaike information criterion the best model is the following: \[\text{HighUse}_{i} = \beta_{0} + \beta_{1}\text{Sex}_{i} + \beta_{2}\text{Activities}_{i} + \beta_{3}\text{Family}_{i} + \beta_{4}\text{GoOut}_{i} + \epsilon_{i}\]

This model has \(\text{AIC} = 400.6\).

glm_best <- glm(formula = high_use ~ sex + activities + famrel + goout, family = "binomial", 
    data = alc_sub)

summary(glm_best)
## 
## Call:
## glm(formula = high_use ~ sex + activities + famrel + goout, family = "binomial", 
##     data = alc_sub)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.5303  -0.7884  -0.5366   0.8617   2.6530  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)    -2.2313     0.6476  -3.445  0.00057 ***
## sexM            0.9971     0.2559   3.896 9.77e-05 ***
## activitiesyes  -0.4494     0.2532  -1.775  0.07588 .  
## famrel         -0.4050     0.1396  -2.902  0.00371 ** 
## goout           0.8118     0.1229   6.603 4.03e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 465.68  on 381  degrees of freedom
## Residual deviance: 390.62  on 377  degrees of freedom
## AIC: 400.62
## 
## Number of Fisher Scoring iterations: 4

We can calculate \(\text{McFadden's Pseudo} R^2\) representing the overall effect size of the model (more about from here):

ll.null <- glm_best$null.deviance/-2
ll.proposed <- glm_best$deviance/-2

pseudo_r_sqr <- (ll.null-ll.proposed)/ll.null

pseudo_r_sqr
## [1] 0.1611876
ll.null2 <- glm1$null.deviance/-2
ll.proposed2 <- glm1$deviance/-2

pseudo_r_sqr_org <- (ll.null2-ll.proposed2)/ll.null2

pseudo_r_sqr_org
## [1] 0.1666838

We get that \(\text{McFadden's Pseudo} R^2 = 0.16\) which is relatively small. The original model has \(\text{McFadden's Pseudo} R^2 = 0.17\) so having extra variables does not increase the effect size significantly.

In this model all other variables (including the intercept) but “extra-curricular activities” (binary) are statistically significant (counted by Wald-test) in 99 % confidence level. Also variable “extra-curricular activities” is significant in 90 % confidence level. All the coefficients are in “log(odds)” -format. With binary variables, coefficient is represinting the “log(odds ratio)” -format, for example variable sex has \(log(\text{odd ratio}) = 0.9971 \Leftrightarrow \text{odd ratio} = 2.71041\). The social variables which were the variables of our interest, having extra-curricular activities has an negative effect to high consumption of alcohol as the good family relationships also does, both with log(odd) -estimation around -0.40 - -0.50. Going out with friends has a positive effect to heavy drinking, \(log(\text{odd}) = 0.8118 \Leftrightarrow \text{odd} = 2.251958\), meaning that approximately the ratio between heavy drinkers of those who go out and those how don’t is \(\frac{5}{2}\).

We can provide 2x2 cross tabulation of predictions versus the actual values. Lets also have a visual look up, how our model “fitted” with data.

## Predicted values of our model
predicted_data <- data.frame(
  probability_of_hc = glm_best$fitted.values,
  high_consumption = alc_sub$high_use)
 
predicted_data <- mutate(predicted_data, prediction = probability_of_hc > 0.5)

predicted_data <- predicted_data[  order(predicted_data$probability_of_hc, decreasing=FALSE),]

predicted_data$rank <- 1:nrow(predicted_data)

# head(predicted_data)

# tabulate the target variable versus the predictions
table(high_use = predicted_data$high_consumption, prediction = predicted_data$prediction)
##         prediction
## high_use FALSE TRUE
##    FALSE   254   14
##    TRUE     75   39
# Plotting
ggplot(data=predicted_data, aes(x=rank, y=probability_of_hc)) +
  geom_point(aes(color=high_consumption), alpha=1, shape=4, stroke=2) +
  xlab("Index") +
  ylab("Predicted probability of high consumption of alcohol")

We see that our model with social parameters controlled by sex made average quality predictions, data is mainly distributed in right way. It can be assumed from the small overall effect size of the model (only 0.17 as shown above) that it would not perfectly generate the right predictions. Also what we have to take into account is that the model doesn’t tell us about the causalitys. For example the causality between high consumption of alcohol and going out with friends has probably a causality in both ways.

Lets do this also with cross-validation for the model.

# define a loss function (average prediction error)
loss_func <- function(class, prob) {
  n_wrong <- abs(class - prob) > 0.5
  mean(n_wrong)
}

# Average number of wrong predictions in the (training) data

loss_func(predicted_data$high_consumption, predicted_data$probability_of_hc)
## [1] 0.2329843
# 10-fold cross-validation

cv <- cv.glm(data = alc_sub, cost = loss_func, glmfit = glm_best, K = 10)

# Average number of wrong predictions in the cross validation
cv$delta[1]
## [1] 0.2408377
# For comparing, lets try probabilities P(High_use) = 1 and 0 

# P(High_use) = 1
loss_func(predicted_data$high_consumption, prob = 1)
## [1] 0.7015707
# P(High_use) = 0
loss_func(predicted_data$high_consumption, prob = 0)
## [1] 0.2984293

Average number of wrong predictions in the training data is 0.23 and 10-fold cross-validation gives 0.24. Our model gives better predictions than just assuming the probability of high consumption to be zero (loss function gives 0.3) or one (loss function gives 0.7).


Analysis, IODS week 4

Here are the packages used in this session:

# REMOVE ALL OBJECTS
rm(list=ls())

# access the MASS package
library(MASS)
library(corrplot)
library(tidyverse)
library(ggplot2)
library(GGally)
library(plotly)

Data

Let’s load the Boston data from the MASS package. Data is all about the “Housing Values in Suburbs of Boston”. More about the data can be read from here.

# load the data
data("Boston")

# explore the dataset
str(Boston)
## 'data.frame':    506 obs. of  14 variables:
##  $ crim   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
##  $ zn     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
##  $ indus  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
##  $ chas   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ nox    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
##  $ rm     : num  6.58 6.42 7.18 7 7.15 ...
##  $ age    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
##  $ dis    : num  4.09 4.97 4.97 6.06 6.06 ...
##  $ rad    : int  1 2 2 3 3 3 5 5 5 5 ...
##  $ tax    : num  296 242 242 222 222 222 311 311 311 311 ...
##  $ ptratio: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
##  $ black  : num  397 397 393 395 397 ...
##  $ lstat  : num  4.98 9.14 4.03 2.94 5.33 ...
##  $ medv   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
summary(Boston)
##       crim                zn             indus            chas        
##  Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   Min.   :0.00000  
##  1st Qu.: 0.08204   1st Qu.:  0.00   1st Qu.: 5.19   1st Qu.:0.00000  
##  Median : 0.25651   Median :  0.00   Median : 9.69   Median :0.00000  
##  Mean   : 3.61352   Mean   : 11.36   Mean   :11.14   Mean   :0.06917  
##  3rd Qu.: 3.67708   3rd Qu.: 12.50   3rd Qu.:18.10   3rd Qu.:0.00000  
##  Max.   :88.97620   Max.   :100.00   Max.   :27.74   Max.   :1.00000  
##       nox               rm             age              dis        
##  Min.   :0.3850   Min.   :3.561   Min.   :  2.90   Min.   : 1.130  
##  1st Qu.:0.4490   1st Qu.:5.886   1st Qu.: 45.02   1st Qu.: 2.100  
##  Median :0.5380   Median :6.208   Median : 77.50   Median : 3.207  
##  Mean   :0.5547   Mean   :6.285   Mean   : 68.57   Mean   : 3.795  
##  3rd Qu.:0.6240   3rd Qu.:6.623   3rd Qu.: 94.08   3rd Qu.: 5.188  
##  Max.   :0.8710   Max.   :8.780   Max.   :100.00   Max.   :12.127  
##       rad              tax           ptratio          black       
##  Min.   : 1.000   Min.   :187.0   Min.   :12.60   Min.   :  0.32  
##  1st Qu.: 4.000   1st Qu.:279.0   1st Qu.:17.40   1st Qu.:375.38  
##  Median : 5.000   Median :330.0   Median :19.05   Median :391.44  
##  Mean   : 9.549   Mean   :408.2   Mean   :18.46   Mean   :356.67  
##  3rd Qu.:24.000   3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:396.23  
##  Max.   :24.000   Max.   :711.0   Max.   :22.00   Max.   :396.90  
##      lstat            medv      
##  Min.   : 1.73   Min.   : 5.00  
##  1st Qu.: 6.95   1st Qu.:17.02  
##  Median :11.36   Median :21.20  
##  Mean   :12.65   Mean   :22.53  
##  3rd Qu.:16.95   3rd Qu.:25.00  
##  Max.   :37.97   Max.   :50.00
# plot matrix of the variables
ggpairs(Boston, lower = "blank", upper = list(continuous = "points", combo =
  "facethist", discrete = "facetbar", na = "na"))

Boston dataset has 506 observations and 14 variables. All variables are in numeric format. Variable chas is a dummy variable (Charles River dummy variable, 1 if tract bounds river and 0 otherwise) and variable rad is an index (index of accessibility to radial highways).

Only by waching the plot above, it looks that the data has some linear relations. These relations can bee seen with correlation matrix.

# calculate the correlation matrix and round it
cor_matrix<-cor(Boston) 

# print the correlation matrix
cor_matrix %>% round(digits = 2)
##          crim    zn indus  chas   nox    rm   age   dis   rad   tax
## crim     1.00 -0.20  0.41 -0.06  0.42 -0.22  0.35 -0.38  0.63  0.58
## zn      -0.20  1.00 -0.53 -0.04 -0.52  0.31 -0.57  0.66 -0.31 -0.31
## indus    0.41 -0.53  1.00  0.06  0.76 -0.39  0.64 -0.71  0.60  0.72
## chas    -0.06 -0.04  0.06  1.00  0.09  0.09  0.09 -0.10 -0.01 -0.04
## nox      0.42 -0.52  0.76  0.09  1.00 -0.30  0.73 -0.77  0.61  0.67
## rm      -0.22  0.31 -0.39  0.09 -0.30  1.00 -0.24  0.21 -0.21 -0.29
## age      0.35 -0.57  0.64  0.09  0.73 -0.24  1.00 -0.75  0.46  0.51
## dis     -0.38  0.66 -0.71 -0.10 -0.77  0.21 -0.75  1.00 -0.49 -0.53
## rad      0.63 -0.31  0.60 -0.01  0.61 -0.21  0.46 -0.49  1.00  0.91
## tax      0.58 -0.31  0.72 -0.04  0.67 -0.29  0.51 -0.53  0.91  1.00
## ptratio  0.29 -0.39  0.38 -0.12  0.19 -0.36  0.26 -0.23  0.46  0.46
## black   -0.39  0.18 -0.36  0.05 -0.38  0.13 -0.27  0.29 -0.44 -0.44
## lstat    0.46 -0.41  0.60 -0.05  0.59 -0.61  0.60 -0.50  0.49  0.54
## medv    -0.39  0.36 -0.48  0.18 -0.43  0.70 -0.38  0.25 -0.38 -0.47
##         ptratio black lstat  medv
## crim       0.29 -0.39  0.46 -0.39
## zn        -0.39  0.18 -0.41  0.36
## indus      0.38 -0.36  0.60 -0.48
## chas      -0.12  0.05 -0.05  0.18
## nox        0.19 -0.38  0.59 -0.43
## rm        -0.36  0.13 -0.61  0.70
## age        0.26 -0.27  0.60 -0.38
## dis       -0.23  0.29 -0.50  0.25
## rad        0.46 -0.44  0.49 -0.38
## tax        0.46 -0.44  0.54 -0.47
## ptratio    1.00 -0.18  0.37 -0.51
## black     -0.18  1.00 -0.37  0.33
## lstat      0.37 -0.37  1.00 -0.74
## medv      -0.51  0.33 -0.74  1.00
# visualize the correlation matrix
##corrplot(cor_matrix, method="circle", type = "upper", cl.pos = "b", tl.pos = "d", tl.cex = 0.6)
corrplot(cor_matrix, method="circle", tl.pos = "d")

Some high correlations between variables can be seen, for example between index of accessibility to radial highways and full-value property-tax rate per $10,000 (0.91), proportion of owner-occupied units built prior to 1940 and weighted mean of distances to five Boston employment centres (-0.75) and nitrogen oxides concentration and proportion of non-retail business acres per town (-0.77).

LDA

For further analysis, we will standardize the data. Let’s have a same kind of lookup in the data after we have standadized the data.

# center and standardize variables
boston_std <- scale(Boston)

# summaries of the scaled variables
summary(boston_std)
##       crim                 zn               indus        
##  Min.   :-0.419367   Min.   :-0.48724   Min.   :-1.5563  
##  1st Qu.:-0.410563   1st Qu.:-0.48724   1st Qu.:-0.8668  
##  Median :-0.390280   Median :-0.48724   Median :-0.2109  
##  Mean   : 0.000000   Mean   : 0.00000   Mean   : 0.0000  
##  3rd Qu.: 0.007389   3rd Qu.: 0.04872   3rd Qu.: 1.0150  
##  Max.   : 9.924110   Max.   : 3.80047   Max.   : 2.4202  
##       chas              nox                rm               age         
##  Min.   :-0.2723   Min.   :-1.4644   Min.   :-3.8764   Min.   :-2.3331  
##  1st Qu.:-0.2723   1st Qu.:-0.9121   1st Qu.:-0.5681   1st Qu.:-0.8366  
##  Median :-0.2723   Median :-0.1441   Median :-0.1084   Median : 0.3171  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.:-0.2723   3rd Qu.: 0.5981   3rd Qu.: 0.4823   3rd Qu.: 0.9059  
##  Max.   : 3.6648   Max.   : 2.7296   Max.   : 3.5515   Max.   : 1.1164  
##       dis               rad               tax             ptratio       
##  Min.   :-1.2658   Min.   :-0.9819   Min.   :-1.3127   Min.   :-2.7047  
##  1st Qu.:-0.8049   1st Qu.:-0.6373   1st Qu.:-0.7668   1st Qu.:-0.4876  
##  Median :-0.2790   Median :-0.5225   Median :-0.4642   Median : 0.2746  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.6617   3rd Qu.: 1.6596   3rd Qu.: 1.5294   3rd Qu.: 0.8058  
##  Max.   : 3.9566   Max.   : 1.6596   Max.   : 1.7964   Max.   : 1.6372  
##      black             lstat              medv        
##  Min.   :-3.9033   Min.   :-1.5296   Min.   :-1.9063  
##  1st Qu.: 0.2049   1st Qu.:-0.7986   1st Qu.:-0.5989  
##  Median : 0.3808   Median :-0.1811   Median :-0.1449  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.4332   3rd Qu.: 0.6024   3rd Qu.: 0.2683  
##  Max.   : 0.4406   Max.   : 3.5453   Max.   : 2.9865
# class of the boston_scaled object
class(boston_std)
## [1] "matrix"
# change the object to data frame
boston_scaled <- as.data.frame(boston_std)

From summary we can see that we have succesfully standardized the data. All the variables have mean zero and deviation is standardized. Let’s analyse the crime rate variable deeper.

summary(boston_scaled$crim)
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## -0.419367 -0.410563 -0.390280  0.000000  0.007389  9.924110
# create a quantile vector of crim and print it
bins <- quantile(boston_scaled$crim)
bins
##           0%          25%          50%          75%         100% 
## -0.419366929 -0.410563278 -0.390280295  0.007389247  9.924109610
# create a categorical variable 'crime'
crime <- cut(boston_scaled$crim, breaks = bins, include.lowest = TRUE, label = c("low", "med_low", "med_high", "high"))

# look at the table of the new factor crime
table(crime)
## crime
##      low  med_low med_high     high 
##      127      126      126      127
# remove original crim from the dataset
boston_scaled <- dplyr::select(boston_scaled, -crim)

# add the new categorical value to scaled data
boston_scaled <- data.frame(boston_scaled, crime)

As wee see from crime table above, all quantiles have same number of samples as it should be. Let’s divide the dataset to train and test sets, so that 80% of the data belongs to the train set.

# number of rows in the Boston dataset 
n <- nrow(boston_scaled)

# choose randomly 80% of the rows
ind <- sample(n,  size = n * 0.8)

# create train set
train <- boston_scaled[ind,]

# create test set 
test <- boston_scaled[-ind,]

# save the correct classes from test data
correct_classes <- test$crime

# remove the crime variable from test data
test <- dplyr::select(test, -crime)

Next we will fit the linear discriminant analysis on the train set. We use the categorical crime rate as the target variable and all the other variables in the dataset as predictor variables.

# linear discriminant analysis
lda.fit <- lda(crime ~ ., data = train)

# print the lda.fit object
lda.fit
## Call:
## lda(crime ~ ., data = train)
## 
## Prior probabilities of groups:
##       low   med_low  med_high      high 
## 0.2400990 0.2574257 0.2623762 0.2400990 
## 
## Group means:
##                   zn      indus         chas        nox         rm
## low       0.88571224 -0.9631730 -0.150563081 -0.8745198  0.4246783
## med_low  -0.09413106 -0.2665609 -0.007331936 -0.5443643 -0.1504400
## med_high -0.38045185  0.1737650  0.247665303  0.3454622  0.1183547
## high     -0.48724019  1.0149946 -0.069385757  1.0483486 -0.4624321
##                 age        dis        rad        tax     ptratio
## low      -0.8415368  0.8564521 -0.7036343 -0.7469372 -0.49088999
## med_low  -0.2989224  0.3461849 -0.5511961 -0.4679786 -0.07406251
## med_high  0.3533641 -0.3584289 -0.3957196 -0.3021079 -0.26401188
## high      0.8374969 -0.8558762  1.6596029  1.5294129  0.80577843
##               black       lstat        medv
## low       0.3757553 -0.73954054  0.50544573
## med_low   0.3133960 -0.11330541 -0.03440868
## med_high  0.1300681 -0.01921482  0.19901930
## high     -0.7071548  0.85232768 -0.65303188
## 
## Coefficients of linear discriminants:
##                 LD1          LD2         LD3
## zn       0.16145026  0.698237380 -0.88732882
## indus    0.04387196 -0.376136375  0.46027937
## chas    -0.08020820 -0.062763080  0.08393346
## nox      0.31357800 -0.798142406 -1.45281309
## rm      -0.11094000 -0.108663576 -0.17207489
## age      0.31610729 -0.129583154 -0.04722198
## dis     -0.06841277 -0.216749362  0.17448293
## rad      3.23282481  0.895792406 -0.06893070
## tax     -0.08602375  0.059117851  0.53600662
## ptratio  0.14480039  0.009524140 -0.29849017
## black   -0.17145748 -0.005977023  0.10933614
## lstat    0.13850907 -0.249121627  0.21711326
## medv     0.13739856 -0.373693189 -0.31066216
## 
## Proportion of trace:
##    LD1    LD2    LD3 
## 0.9514 0.0362 0.0124
# the function for lda biplot arrows
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
  heads <- coef(x)
  arrows(x0 = 0, y0 = 0, 
         x1 = myscale * heads[,choices[1]], 
         y1 = myscale * heads[,choices[2]], col="orange", length = arrow_heads)
  text(myscale * heads[,choices], labels = row.names(heads), 
       cex = tex, col= "orange", pos=3)
}

# target classes as numeric
classes <- as.numeric(train$crime)

# plot the lda results
plot(lda.fit, dimen = 2, col = classes, pch = classes)
lda.arrows(lda.fit, myscale = 1)

From the plot and summary above we can see that variable index of accessibility to radial highways has the greatest seperation weight related to other variables, both in LD1 and LD2 dimensions. In LDI coefficient for index of accessibility to radial highways is 3.4 when second greatest weight, 0.37, is variables nitrogen oxides concentration. In dimension LD2 variable index of accessibility to radial highways does not have such a dominance as in LD1. In both dimensions it is positively correlated with crime rate. In LD2 coefficient for it is 0.9, when for example variables proportion of residential land zoned for lots over 25,000 sq.ft. and nitrogen oxides concentration (parts per 10 million) have both coefficient around 0.7 as in absolute value, but these LD2 components have oppposite directions (nitrogen oxides concentration (parts per 10 million) increases criminal rate). In our LD1-LD2 map moving south-east inreases the criminal rate.

For next we will test how well our model predicts the crime rate. Earlier in our analysis we separated our data in learning and testing parts and we removed categorial variable crime from test data.

# predict classes with test data
lda.pred <- predict(lda.fit, newdata = test)

# cross tabulate the results
table(correct = correct_classes, predicted = lda.pred$class)
##           predicted
## correct    low med_low med_high high
##   low       19       9        2    0
##   med_low    4      15        3    0
##   med_high   0       6       14    0
##   high       0       0        1   29

In table above we see that our model predicts high crime rate very well. Also in other rate classes our model makes quite good predictions, even in classes med_high and med_low.

K-means clustering

Let’s use original standardized data with original crime rate variable. Then we will alalyse data with K-means clustering method.

boston_std <- as.data.frame(boston_std)


boston_dist <- dist(boston_std)

set.seed(123)

# determine the number of clusters
k_max <- 10

# calculate the total within sum of squares
twcss <- sapply(1:k_max, function(k){kmeans(boston_std, k)$tot.withinss})

# visualize the results
qplot(x = 1:k_max, y = twcss, geom = 'line')

Based on graph above we decide to use two clusters.

# K-means clustering
km <-kmeans(boston_std, centers = 2)



# plot the Boston dataset with clusters
#pairs(Boston, col = km$cluster)
class(km$cluster)
## [1] "integer"
ggpairs(boston_std, mapping = aes(col = as.factor(km$cluster), alpha = 0.3), lower = "blank", upper = list(continuous = "points", combo =
  "facethist", discrete = "facetbar", na = "na"))

From the plot above we see that k-means clustering works quite well, in most cases those two clusters are enough to seperate the data to two resonable distributions. For example, in the case of nitrogen oxides concentration (nox) we see quite clearly two different distributions.

Bonus

For a bonus analyse we will search for a reasonable k-mean clustering for original, standardized Manhattan data. Then we will use these clusters as target variables in new LDA-model and find the variables which have most weight in clustering.

boston_scaled_j <- scale(Boston)

boston_scaled_j <- as.data.frame(boston_scaled_j)

# calculate the total within sum of squares
twcss_j <- sapply(1:k_max, function(k){kmeans(boston_scaled_j, k)$tot.withinss})

# visualize the results
qplot(x = 1:k_max, y = twcss_j, geom = 'line')

## Lets use 7 clusters

km_j <-kmeans(boston_scaled_j, centers = 7)

lda.fit_j <- lda(km_j$cluster ~ ., data = boston_scaled_j)

lda.fit_j
## Call:
## lda(km_j$cluster ~ ., data = boston_scaled_j)
## 
## Prior probabilities of groups:
##          1          2          3          4          5          6 
## 0.19960474 0.17984190 0.06719368 0.27272727 0.08498024 0.07509881 
##          7 
## 0.12055336 
## 
## Group means:
##         crim         zn      indus       chas        nox         rm
## 1 -0.3276033 -0.4819336  0.4539433 -0.2723291  0.4033614 -0.4554552
## 2  0.6891277 -0.4872402  1.0922021 -0.2723291  0.9675559 -0.5127412
## 3 -0.1985497 -0.2602436  0.2799956  3.6647712  0.3830784  0.2756681
## 4 -0.3997875 -0.1254255 -0.6189159 -0.2723291 -0.7089634 -0.1617920
## 5 -0.3732407 -0.1422288 -0.8310017 -0.2723291 -0.2005297  1.6901170
## 6  1.9373059 -0.4872402  1.0149946 -0.2723291  0.9805230 -0.2986618
## 7 -0.4142556  2.3574118 -1.1833581 -0.2077864 -1.1903591  0.7260513
##          age        dis          rad        tax     ptratio      black
## 1  0.7139676 -0.5156239 -0.599806921 -0.2986652  0.20347210  0.0769363
## 2  0.7518833 -0.7972862  1.533397731  1.5440835  0.80324049  0.1646510
## 3  0.3721322 -0.4033382  0.001081444 -0.0975633 -0.39245849  0.1715427
## 4 -0.7541779  0.6240671 -0.573249961 -0.6985399 -0.09292879  0.3605290
## 5  0.1841367 -0.3232389 -0.511800977 -0.8155263 -1.10522083  0.3496304
## 6  0.7720736 -0.8769466  1.659602895  1.5294129  0.80577843 -3.0248555
## 7 -1.4158153  1.6302712 -0.671220280 -0.5521444 -0.82906375  0.3536252
##        lstat        medv
## 1  0.4437738 -0.43288856
## 2  0.8098978 -0.65415404
## 3 -0.1643525  0.57334091
## 4 -0.4176324  0.01266366
## 5 -0.9616241  1.71992796
## 6  1.1866642 -1.08972128
## 7 -0.9679345  0.81083758
## 
## Coefficients of linear discriminants:
##                  LD1         LD2         LD3         LD4         LD5
## crim    -0.055858880 -0.28280174  0.39942789  0.62394305  0.07126916
## zn      -0.256103484  0.17091712  1.57768449 -0.60309927 -0.13217489
## indus    0.079839746 -0.47024036 -0.16909868 -0.26420738 -0.52074287
## chas     5.832830141  0.13558415  0.33654481  0.06117107 -0.24021698
## nox     -0.008708387  0.52783295 -0.21825248 -0.10482010  0.13145221
## rm      -0.082216127 -0.06752441  0.09541076  0.18047678  0.61851276
## age      0.084401663 -0.19943772 -0.48493022  0.12288716  0.40058616
## dis      0.057464812  0.42853710  0.33454359 -0.10425677 -0.25117932
## rad      0.178314081 -1.67575671  0.08941298 -0.89338698  0.56080972
## tax      0.075888055 -0.61822344  0.84585974 -0.86928072  0.03445999
## ptratio  0.017381740 -0.15088883 -0.06303904  0.08590537 -0.28534418
## black    0.058211177  0.84962506 -0.76578638 -1.82606194  0.15371287
## lstat   -0.123096250 -0.19683160  0.04588153 -0.03043413  0.30872764
## medv    -0.195627826  0.33721840  0.09174908  0.15224414  0.89521721
##                 LD6
## crim     0.08236240
## zn      -1.25731674
## indus   -0.27992287
## chas     0.06410065
## nox     -0.51458192
## rm      -0.10578655
## age     -0.82552088
## dis      0.31487537
## rad      1.64313640
## tax     -0.61832490
## ptratio -0.53422706
## black    0.10592104
## lstat   -0.20241633
## medv    -0.22594636
## 
## Proportion of trace:
##    LD1    LD2    LD3    LD4    LD5    LD6 
## 0.6103 0.2326 0.0749 0.0498 0.0175 0.0148
# the function for lda biplot arrows
lda.arrows_j <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
  heads <- coef(x)
  arrows(x0 = 0, y0 = 0, 
         x1 = myscale * heads[,choices[1]], 
         y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
  text(myscale * heads[,choices], labels = row.names(heads), 
       cex = tex, col=color, pos=3)
}

# target classes as numeric
classes_j <- as.numeric(km_j$cluster)

# plot the lda results
plot(lda.fit_j, dimen = 2, col = classes_j, pch = classes_j)
lda.arrows(lda.fit_j, myscale = 1)

Data is well clustered in seven part. Once again variable rad, index of accessibility to radial highways has a major influence in LD1 dimension. In LD2 based dimension variable tax, full-value property-tax rate per $10,000 has the greatest weight.

Super Bonus

model_predictors <- dplyr::select(train, -crime)

# check the dimensions

dim(model_predictors)
## [1] 404  13
dim(lda.fit$scaling)
## [1] 13  3
# matrix multiplication

matrix_product <- as.matrix(model_predictors) %*% lda.fit$scaling
matrix_product <- as.data.frame(matrix_product)


plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers', color = train$crime)
# Let's find k-clusters for train set

n_k <- nrow(Boston)

ind_k <- sample(n_k, size = n_k*0.8)


train_k <- Boston[ind_k,]

train_k <- scale(train_k)


twcss2 <- sapply(1:k_max, function(k){kmeans(stats::na.omit(train_k), k)$tot.withinss})

# visualize the results
qplot(x = 1:k_max, y = twcss2, geom = 'line') # 2

# k-means clustering
km_k <-kmeans(train_k, centers = 2)

plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers', color = km_k$cluster)
# Let's increase the number of clusters to 4

km_l <-kmeans(train_k, centers = 4)

plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers', color = km_l$cluster)

In this part we can compare the 3D scatterplots where base vectors are from LDA model made above. All the points in these three 3D scatterplots are in the same cordinates, only the colours differ.The first 3D scatterplot is colored with four crime rates, the second with 2-mean clustering method and the third with 4-mean clustering method. Colours distributions in k-mean clustering plots do not respond the colour distibution of crime rate distribution. It might be that criminal rate does not have as high weight as some other variables in the sense of clustering because it has relatively slight correlation with other variables (this can be seen above).


Analysis, IODS week 5

Here are the packages used in this session:

# REMOVE ALL OBJECTS
rm(list=ls())

library(GGally)
library(ggplot2)
library(dplyr)
library(corrplot)
library(tidyr)
library(FactoMineR)

Human data

Import data

Let’s import data and have a small lookup in it.

human <- read.csv("C:/Users/juhol/OneDrive/Documents/GitHub/IODS-project/data/human.csv", sep  =",", header = T, row.names = 1)


# look at the (column) names of human
names(human)
## [1] "life_exp"                 "exp_years_education"     
## [3] "gni"                      "maternal_mortality_ratio"
## [5] "adolescent_birth_rate"    "female_parliament"       
## [7] "sec_edu_ratio"            "labour_rate_ratio"
# look at the structure of human
str(human)
## 'data.frame':    155 obs. of  8 variables:
##  $ life_exp                : num  81.6 82.4 83 80.2 81.6 80.9 80.9 79.1 82 81.8 ...
##  $ exp_years_education     : num  17.5 20.2 15.8 18.7 17.9 16.5 18.6 16.5 15.9 19.2 ...
##  $ gni                     : int  64992 42261 56431 44025 45435 43919 39568 52947 42155 32689 ...
##  $ maternal_mortality_ratio: int  4 6 6 5 6 7 9 28 11 8 ...
##  $ adolescent_birth_rate   : num  7.8 12.1 1.9 5.1 6.2 3.8 8.2 31 14.5 25.3 ...
##  $ female_parliament       : num  39.6 30.5 28.5 38 36.9 36.9 19.9 19.4 28.2 31.4 ...
##  $ sec_edu_ratio           : num  1.007 0.997 0.983 0.989 0.969 ...
##  $ labour_rate_ratio       : num  0.891 0.819 0.825 0.884 0.829 ...
# print out summaries of the variables
summary(human)
##     life_exp     exp_years_education      gni        
##  Min.   :49.00   Min.   : 5.40       Min.   :   581  
##  1st Qu.:66.30   1st Qu.:11.25       1st Qu.:  4198  
##  Median :74.20   Median :13.50       Median : 12040  
##  Mean   :71.65   Mean   :13.18       Mean   : 17628  
##  3rd Qu.:77.25   3rd Qu.:15.20       3rd Qu.: 24512  
##  Max.   :83.50   Max.   :20.20       Max.   :123124  
##  maternal_mortality_ratio adolescent_birth_rate female_parliament
##  Min.   :   1.0           Min.   :  0.60        Min.   : 0.00    
##  1st Qu.:  11.5           1st Qu.: 12.65        1st Qu.:12.40    
##  Median :  49.0           Median : 33.60        Median :19.30    
##  Mean   : 149.1           Mean   : 47.16        Mean   :20.91    
##  3rd Qu.: 190.0           3rd Qu.: 71.95        3rd Qu.:27.95    
##  Max.   :1100.0           Max.   :204.80        Max.   :57.50    
##  sec_edu_ratio    labour_rate_ratio
##  Min.   :0.1717   Min.   :0.1857   
##  1st Qu.:0.7264   1st Qu.:0.5984   
##  Median :0.9375   Median :0.7535   
##  Mean   :0.8529   Mean   :0.7074   
##  3rd Qu.:0.9968   3rd Qu.:0.8535   
##  Max.   :1.4967   Max.   :1.0380

In our first part of analysis we are taking a brief look in OECD data from HUMAN DEVELOPMENT REPORT 2015. In this section we are using subset from original data including variables Life expectancy at birth, Expected years of schooling, GNI per capita, Maternal mortality ratio, Adolescent birth rate, Female’s share of parliamentary seats, Female and male ratio of at least secondary education and Female and male ratio labour force participation rates. In dataset we have 155 observations which are representing countries.

All variables are including only numerical values. From summary we can see that all variables are variating relatively highly. Let’s have visual perspective in data.

# visualize the 'human' variables


my_fn <- function(data, mapping, ...){
  p <- ggplot(data = data, mapping = mapping) + 
    geom_point() + 
    geom_smooth(method=loess, fill="red", color="red", ...) +
    geom_smooth(method=lm, fill="blue", color="blue", ...)
  p
}



ggpairs(human, lower = list(continuous = my_fn))

# compute the correlation matrix and visualize it with corrplot
cor(human) %>% corrplot

Distributions of variables Life expectancy at birth, GNI per capita, Adolescent birth rate and Female and male ratio labour force participation rates are highly skewed. Remaining part of variables look approximately normally distributed. From plot above we can see two drawn lines, the OLS estimated line and smooth curve. OLS line is representing possible linear relation between variables and smooth curve is estimated by counting the average line. Also correlations are representing linear relations. Clear linear or log-linear relation can be seen between many variables, for example between Life expectancy at birth and Expected years of schooling (corr 0.78), Maternal mortality ratio and Life expectancy at birth (corr -0.86, of corse maternal mortality effect directly to life expectancy) and between Adolescent birth rate and Life expectancy at birth (corr -0.73).

PCA not standardized for human data

Next we are using principal components analysis (PCA) for human data. At first, we are analysing not standardized human data and after that we are comparing the results with standardized data.PCA is popular approach to reduce the dimensions in data. The main idea in PCA technique is to find the principal components, meaning the directions where observations have most variation. The first principal component is the direction where data is variating the most, the second principal component is the direction where data is variating the second most and etc.

pca_human_not_std <- prcomp(human)

sum_pca_human_not_std <- summary(pca_human_not_std)

# rounded percetanges of variance captured by each PC
pca_pr_not_std <- round(100*sum_pca_human_not_std$importance[2, ], digits = 3)

# print out the percentages of variance
pca_pr_not_std
##   PC1   PC2   PC3   PC4   PC5   PC6   PC7   PC8 
## 99.99  0.01  0.00  0.00  0.00  0.00  0.00  0.00
# create object pc_lab to be used as axis labels
pc_lab_not_std <- paste0(names(pca_pr_not_std), " (", pca_pr_not_std, "%)")

# draw a biplot
biplot(pca_human_not_std, cex = c(0.8, 1), col = c("grey40", "deeppink2"), xlab = pc_lab_not_std[1], ylab = pc_lab_not_std[2])

The first principal component is capturing nearly all the variance (99.99%). The GNI variable is nearly only component spanning the PC1. This comes from the fact that GNI is variating between 581 -123124 where other variables variation is dozens of times smaller. That is why we have to standardize our data before using PCA.

PCA standardized for human data

human_std <- scale(human)

pca_human <- prcomp(human_std)

sum_pca_human <- summary(pca_human)

# rounded percetanges of variance captured by each PC
pca_pr <- round(100*sum_pca_human$importance[2, ], digits = 3)

# print out the percentages of variance
pca_pr
##    PC1    PC2    PC3    PC4    PC5    PC6    PC7    PC8 
## 53.605 16.237  9.571  7.583  5.477  3.595  2.634  1.298
# create object pc_lab to be used as axis labels
pc_lab <- paste0(names(pca_pr), " (", pca_pr, "%)")

# draw a biplot
biplot(pca_human, cex = c(0.8, 1), col = c("grey40", "deeppink2"), xlab = pc_lab[1], ylab = pc_lab[2])

Now principal components are capturing reasonable amount of variation. The PC1 is capturing 53.6 % of variation, PC2 16.2% and cumulatively the first two PCs are capturing 69.8 % of variation. Interpretation of the graph above is that x-axis is representing the dimension spanned by PC1 and y-axis is correspondly representing the dimension spanned by PC2. Variance of our variables can be represented as linear combination of the principal components. For example, life expectancy’s coefficient in direction PC1 is -0.44 when maternal_mortality_ratio has opposite direction in dimension PC1 with coefficient 0.44. In two first principal components spanned space life expectancy and labour rate ratio are nearly orthogonal meaning that variance of the first is orthogonal related to to others variance (correlation between these two variables is -0.14). In space spanned with fist eight principal components cordinate for life expectancy is \([-0.44372240, 0.02530473, -0.10991305, 0.05834819, -0.1628935, 0.42242796, -0.43406432, -0.62737008]\).

In our bixplot horizontal dimension (PC1) can bee seen as representation of the wealthy of the country and vertical direction (PC2) as equility variance. In the South-West corner of the bixplot we can find Northern Countries and from the really opposite corner countries such as Yemen and Afghanistan.

For making the analyse it is necessary to scale the data. PCA is based only on the variance of the data and normally we are not interested in the differences of variance relating the scale.

Tea time

Let’s import the Tea data from the package Factominer.

tea <- read.table("http://factominer.free.fr/book/tea.csv",header=TRUE,sep=";")

colnames(tea)
##  [1] "breakfast"         "afternoon.tea"     "evening"          
##  [4] "after.lunch"       "after.dinner"      "anytime"          
##  [7] "home"              "work"              "tearoom"          
## [10] "friends"           "restaurant"        "pub"              
## [13] "variety"           "how"               "sugar"            
## [16] "format"            "place.of.purchase" "type"             
## [19] "sex"               "profession"        "sport"            
## [22] "age"               "age_Q"             "frequency"        
## [25] "exotic"            "spirituality"      "good.for.health"  
## [28] "diuretic"          "friendliness"      "iron.absorption"  
## [31] "feminine"          "refined"           "slimming"         
## [34] "stimulant"         "relaxant"          "no.effect.health"
str(tea)
## 'data.frame':    300 obs. of  36 variables:
##  $ breakfast        : Factor w/ 2 levels "breakfast","not.breakfast": 1 1 2 2 1 2 1 2 1 1 ...
##  $ afternoon.tea    : Factor w/ 2 levels "afternoon.tea",..: 2 2 1 2 2 2 1 1 1 2 ...
##  $ evening          : Factor w/ 2 levels "evening","not.evening": 2 2 1 2 1 2 2 1 2 1 ...
##  $ after.lunch      : Factor w/ 2 levels "after.lunch",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ after.dinner     : Factor w/ 2 levels "after.dinner",..: 2 2 1 1 2 1 2 2 2 2 ...
##  $ anytime          : Factor w/ 2 levels "anytime","not.anytime": 2 2 2 2 1 2 2 2 2 2 ...
##  $ home             : Factor w/ 2 levels "home","not.home": 1 1 1 1 1 1 1 1 1 1 ...
##  $ work             : Factor w/ 2 levels "not.work","work": 1 1 2 1 1 1 1 1 1 1 ...
##  $ tearoom          : Factor w/ 2 levels "not.tearoom",..: 1 1 1 1 1 1 1 1 1 2 ...
##  $ friends          : Factor w/ 2 levels "friends","not.friends": 2 2 1 2 2 2 1 2 2 2 ...
##  $ restaurant       : Factor w/ 2 levels "not.restaurant",..: 1 1 2 1 1 1 1 1 1 1 ...
##  $ pub              : Factor w/ 2 levels "not.pub","pub": 1 1 1 1 1 1 1 1 1 1 ...
##  $ variety          : Factor w/ 3 levels "black","flavoured",..: 1 1 2 2 2 2 2 1 2 1 ...
##  $ how              : Factor w/ 4 levels "lemon","milk",..: 3 2 3 3 3 3 3 2 2 3 ...
##  $ sugar            : Factor w/ 2 levels "not.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
##  $ format           : Factor w/ 3 levels "loose","sachet",..: 2 2 2 2 2 2 2 2 3 3 ...
##  $ place.of.purchase: Factor w/ 3 levels "specialist.shop",..: 2 2 2 2 2 2 2 2 3 3 ...
##  $ type             : Factor w/ 6 levels "cheapest","known.brand",..: 5 6 6 6 6 4 6 6 3 3 ...
##  $ sex              : Factor w/ 2 levels "F","M": 2 1 1 2 2 2 2 1 2 2 ...
##  $ profession       : Factor w/ 7 levels "employee","management",..: 2 2 4 6 1 6 5 2 5 5 ...
##  $ sport            : Factor w/ 2 levels "not.sporty","sporty": 2 2 2 1 2 2 2 2 2 1 ...
##  $ age              : int  39 45 47 23 48 21 37 36 40 37 ...
##  $ age_Q            : Factor w/ 5 levels "15-24","25-34",..: 3 4 4 1 4 1 3 3 3 3 ...
##  $ frequency        : Factor w/ 4 levels "1 to 2/week",..: 2 2 4 2 4 2 3 1 4 4 ...
##  $ exotic           : Factor w/ 2 levels "exotic","not.exotic": 2 1 2 1 1 2 2 2 2 2 ...
##  $ spirituality     : Factor w/ 2 levels "not.spirituality",..: 1 1 1 2 2 1 1 1 1 1 ...
##  $ good.for.health  : Factor w/ 2 levels "good for health",..: 1 1 1 1 2 1 1 1 2 1 ...
##  $ diuretic         : Factor w/ 2 levels "diuretic","not.diuretic": 2 1 1 2 1 2 2 2 2 1 ...
##  $ friendliness     : Factor w/ 2 levels "friendliness",..: 2 2 1 2 1 2 2 1 2 1 ...
##  $ iron.absorption  : Factor w/ 2 levels "iron absorption",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ feminine         : Factor w/ 2 levels "feminine","not.feminine": 2 2 2 2 2 2 2 1 2 2 ...
##  $ refined          : Factor w/ 2 levels "not.refined",..: 1 1 1 2 1 1 1 2 2 1 ...
##  $ slimming         : Factor w/ 2 levels "not.slimming",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ stimulant        : Factor w/ 2 levels "not.stimulant",..: 1 2 1 1 1 1 1 1 1 1 ...
##  $ relaxant         : Factor w/ 2 levels "not.relaxant",..: 1 1 2 2 2 2 2 2 2 2 ...
##  $ no.effect.health : Factor w/ 2 levels "no.effect.health",..: 2 2 2 2 2 2 2 2 2 2 ...
summary(tea)
##          breakfast           afternoon.tea        evening   
##  breakfast    :144   afternoon.tea  :169   evening    :103  
##  not.breakfast:156   Not.afternoon.t:131   not.evening:197  
##                                                             
##                                                             
##                                                             
##                                                             
##                                                             
##           after.lunch            after.dinner        anytime   
##  after.lunch    : 44   after.dinner    : 21   anytime    :103  
##  not.after.lunch:256   not.after.dinner:279   not.anytime:197  
##                                                                
##                                                                
##                                                                
##                                                                
##                                                                
##        home           work            tearoom           friends   
##  home    :291   not.work:213   not.tearoom:242   friends    :196  
##  not.home:  9   work    : 87   tearoom    : 58   not.friends:104  
##                                                                   
##                                                                   
##                                                                   
##                                                                   
##                                                                   
##           restaurant       pub           variety               how     
##  not.restaurant:221   not.pub:237   black    : 74   lemon        : 33  
##  restaurant    : 79   pub    : 63   flavoured:193   milk         : 63  
##                                     green    : 33   nothing.added:195  
##                                                     other        :  9  
##                                                                        
##                                                                        
##                                                                        
##        sugar              format                  place.of.purchase
##  not.sugar:155   loose       : 36   specialist.shop        : 30    
##  sugar    :145   sachet      :170   supermarket            :192    
##                  sachet+loose: 94   supermarket+specialist.: 78    
##                                                                    
##                                                                    
##                                                                    
##                                                                    
##           type     sex                 profession        sport    
##  cheapest   :  7   F:178   employee         :59   not.sporty:121  
##  known.brand: 95   M:122   management       :40   sporty    :179  
##  luxury     : 53           manual labourer  :12                   
##  shop.brand : 21           other work       :20                   
##  unknown    : 12           senior management:35                   
##  varies     :112           student          :70                   
##                            unemployed       :64                   
##       age             age_Q              frequency          exotic   
##  Min.   :15.00   15-24   :92   1 to 2/week    : 44   exotic    :142  
##  1st Qu.:23.00   25-34   :69   1/day          : 95   not.exotic:158  
##  Median :32.00   35-44   :40   3 to 6/week    : 34                   
##  Mean   :37.05   45-59   :61   more than 2/day:127                   
##  3rd Qu.:48.00   60 and +:38                                         
##  Max.   :90.00                                                       
##                                                                      
##            spirituality            good.for.health         diuretic  
##  not.spirituality:206   good for health    :210    diuretic    :174  
##  spirituality    : 94   not.good for health: 90    not.diuretic:126  
##                                                                      
##                                                                      
##                                                                      
##                                                                      
##                                                                      
##            friendliness            iron.absorption         feminine  
##  friendliness    :242   iron absorption    : 31    feminine    :129  
##  not.friendliness: 58   not.iron absorption:269    not.feminine:171  
##                                                                      
##                                                                      
##                                                                      
##                                                                      
##                                                                      
##         refined            slimming           stimulant  
##  not.refined: 85   not.slimming:255   not.stimulant:184  
##  refined    :215   slimming    : 45   stimulant    :116  
##                                                          
##                                                          
##                                                          
##                                                          
##                                                          
##          relaxant                    no.effect.health
##  not.relaxant:113   no.effect.health         : 66    
##  relaxant    :187   not.without.effect.health:234    
##                                                      
##                                                      
##                                                      
##                                                      
## 

In Tea data most of the variables are in factor format. Data is containing 36 variables and 300 observations. Let’s have a visual overlook in data.

# Splitting the plot

colnames(tea)
##  [1] "breakfast"         "afternoon.tea"     "evening"          
##  [4] "after.lunch"       "after.dinner"      "anytime"          
##  [7] "home"              "work"              "tearoom"          
## [10] "friends"           "restaurant"        "pub"              
## [13] "variety"           "how"               "sugar"            
## [16] "format"            "place.of.purchase" "type"             
## [19] "sex"               "profession"        "sport"            
## [22] "age"               "age_Q"             "frequency"        
## [25] "exotic"            "spirituality"      "good.for.health"  
## [28] "diuretic"          "friendliness"      "iron.absorption"  
## [31] "feminine"          "refined"           "slimming"         
## [34] "stimulant"         "relaxant"          "no.effect.health"
tea_subset_1 <- dplyr::select(tea, breakfast:pub)

tea_subset_2 <- dplyr::select(tea, variety:frequency) %>% select(-age)

tea_subset_3 <- dplyr::select(tea, exotic:no.effect.health)

# visualize the dataset

gather(tea_subset_1) %>% ggplot(aes(value)) + facet_wrap("key", scales = "free") + geom_bar() + theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 8))

gather(tea_subset_2) %>% ggplot(aes(value)) + facet_wrap("key", scales = "free") + geom_bar() + theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 8))

gather(tea_subset_3) %>% ggplot(aes(value)) + facet_wrap("key", scales = "free") + geom_bar() + theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 8))

The last 12 variables are all categorical variables in “yes - now” dimension. They all are represnting reasons to consume tea. In this part of analysis we are interested in these variables.

# Lets select subset of data. Why people are drinking tea?

tea_time <- dplyr::select(tea, exotic, spirituality, good.for.health, diuretic, friendliness, iron.absorption, feminine, refined, slimming, stimulant, relaxant, no.effect.health)

gather(tea_time) %>% ggplot(aes(value)) + facet_wrap("key", scales = "free") + geom_bar() + theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 8))

For categorical varibles there is technique which is closely corresponding the PCA, Multiple Correspondence Analysis (MCA). More about MCA can be read from here.

# multiple correspondence analysis
mca <- MCA(tea_time, graph = FALSE)

# summary of the model
summary(mca)
## 
## Call:
## MCA(X = tea_time, graph = FALSE) 
## 
## 
## Eigenvalues
##                        Dim.1   Dim.2   Dim.3   Dim.4   Dim.5   Dim.6
## Variance               0.141   0.121   0.104   0.098   0.096   0.081
## % of var.             14.140  12.066  10.407   9.788   9.606   8.136
## Cumulative % of var.  14.140  26.207  36.613  46.401  56.007  64.143
##                        Dim.7   Dim.8   Dim.9  Dim.10  Dim.11  Dim.12
## Variance               0.073   0.066   0.063   0.059   0.052   0.045
## % of var.              7.271   6.626   6.280   5.905   5.237   4.537
## Cumulative % of var.  71.414  78.040  84.321  90.226  95.463 100.000
## 
## Individuals (the 10 first)
##                        Dim.1    ctr   cos2    Dim.2    ctr   cos2    Dim.3
## 1                   | -0.691  1.124  0.425 | -0.345  0.329  0.106 | -0.384
## 2                   | -0.307  0.222  0.081 |  0.155  0.066  0.020 | -0.565
## 3                   | -0.215  0.109  0.071 | -0.518  0.742  0.411 | -0.255
## 4                   | -0.123  0.036  0.015 | -0.284  0.223  0.079 |  0.539
## 5                   | -0.190  0.085  0.037 | -0.038  0.004  0.001 |  0.321
## 6                   | -0.608  0.872  0.357 | -0.578  0.922  0.322 | -0.036
## 7                   | -0.608  0.872  0.357 | -0.578  0.922  0.322 | -0.036
## 8                   |  0.048  0.005  0.004 | -0.448  0.554  0.347 |  0.161
## 9                   | -0.612  0.883  0.368 | -0.205  0.116  0.041 |  0.308
## 10                  | -0.215  0.109  0.071 | -0.518  0.742  0.411 | -0.255
##                        ctr   cos2  
## 1                    0.473  0.131 |
## 2                    1.023  0.274 |
## 3                    0.209  0.100 |
## 4                    0.931  0.285 |
## 5                    0.329  0.105 |
## 6                    0.004  0.001 |
## 7                    0.004  0.001 |
## 8                    0.083  0.045 |
## 9                    0.303  0.093 |
## 10                   0.209  0.100 |
## 
## Categories (the 10 first)
##                        Dim.1    ctr   cos2 v.test    Dim.2    ctr   cos2
## exotic              |  0.214  1.283  0.041  3.515 |  0.394  5.087  0.140
## not.exotic          | -0.193  1.153  0.041 -3.515 | -0.355  4.572  0.140
## not.spirituality    | -0.236  2.253  0.122 -6.039 | -0.028  0.037  0.002
## spirituality        |  0.517  4.937  0.122  6.039 |  0.061  0.081  0.002
## good for health     |  0.314  4.064  0.230  8.290 | -0.350  5.915  0.285
## not.good for health | -0.732  9.483  0.230 -8.290 |  0.816 13.801  0.285
## diuretic            |  0.406  5.637  0.228  8.252 |  0.122  0.594  0.020
## not.diuretic        | -0.561  7.785  0.228 -8.252 | -0.168  0.820  0.020
## friendliness        |  0.156  1.161  0.102  5.519 | -0.008  0.004  0.000
## not.friendliness    | -0.652  4.842  0.102 -5.519 |  0.034  0.015  0.000
##                     v.test    Dim.3    ctr   cos2 v.test  
## exotic               6.466 |  0.426  6.871  0.163  6.979 |
## not.exotic          -6.466 | -0.383  6.175  0.163 -6.979 |
## not.spirituality    -0.712 | -0.236  3.072  0.122 -6.050 |
## spirituality         0.712 |  0.518  6.731  0.122  6.050 |
## good for health     -9.239 | -0.200  2.243  0.093 -5.284 |
## not.good for health  9.239 |  0.467  5.234  0.093  5.284 |
## diuretic             2.475 | -0.326  4.944  0.147 -6.630 |
## not.diuretic        -2.475 |  0.451  6.828  0.147  6.630 |
## friendliness        -0.287 | -0.014  0.012  0.001 -0.485 |
## not.friendliness     0.287 |  0.057  0.051  0.001  0.485 |
## 
## Categorical variables (eta2)
##                       Dim.1 Dim.2 Dim.3  
## exotic              | 0.041 0.140 0.163 |
## spirituality        | 0.122 0.002 0.122 |
## good.for.health     | 0.230 0.285 0.093 |
## diuretic            | 0.228 0.020 0.147 |
## friendliness        | 0.102 0.000 0.001 |
## iron.absorption     | 0.068 0.008 0.011 |
## feminine            | 0.310 0.009 0.007 |
## refined             | 0.215 0.030 0.090 |
## slimming            | 0.275 0.055 0.003 |
## stimulant           | 0.030 0.258 0.127 |
# visualize MCA
plot(mca, invisible=c("ind"), habillage = "quali")

As in PCA, dimensions are ordered by the variance. Dim 1 is capturing 14.1 % of the overall variance. Cumulatively first two variables are responding 26.2 % of the variance, meaning that major part of the data’s overall variance is captured by other dimensions.

From bixplot above we can see that most of the categorical varibles values are quite near by the origo in our 2D space. This means that two dimensional MCA model does not explain major part of the variaty in our tea data. Still, there are some values in our data making clusters, for example group of values no effect on health, not good for health and not relaxant can be ditinguished (interpretation could be “the cluster of pessimists”). Dim 1 could be seen as “effect of the tea” dimension where values in left are not indentifying any effect of the and in the right values are more devoted on positive effect of tea.